Skip to content

Conversation

cbuescher
Copy link
Member

@cbuescher cbuescher commented Sep 22, 2025

This change improves point-in-time (PIT) context relocations in cases where we
loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry
shards where open PIT contexts have been lost in the case where re-create
identical reader contests on wither the same or a different node for read-only
and frozen indices (via FrozenEngine / ReadOnlyEngine).
However, currently we are re-creating these contexts with every subsequent search
because we don’t store the recreated context in a similar way as when we are opening
a new PIT. This PR avoids this extra work by making the following changes:

  • map ReaderContexts held in SearchService by more than just an auto-incremented Long
    id that can clash between different nodes. Instead this PR changes the lookup to use the full
    ShardSearchContextId

  • re-writing the PIT id we return with every search request in case where the results
    we receive show that the shards involved in answering the PIT request have been
    served by a different node than the one originally encoded in the PIT id

The later is possible because the in the original PIT design already provided
hooks for changing the PIT id between search request and we already document
that open point in time ids can change between requests.
Currently, without re-writing the PIT ids, we can get e.g. into situations where
we lose a shard PIT context due to node restart, but are able to re-create it on
another node. If the shards in the group got reassigned to nodes different than
the original node, subsequent requests can get routed for retries to alternating
different nodes. This doesn’t matter in the current situation where we throw away each
re-created PIT context after use, but with this improvement that tracks these
context on the new nodes again, we want to make sure that subsequent searches go
to the same node.

@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Sep 22, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @cbuescher, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

This change improves point-in-time (PIT) context relocations in cases where we
loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry
shards where open PIT contexts have been lost in the case where re-create
identical reader contests on wither the same or a different node for read-only
and frozen indices (via FrozenEngine / ReadOnlyEngine).
However, currently we are re-creating these contexts with every subsequent search
because we don’t store the recreated context in a similar way as when we are opening
a new PIT. This PR avoids this extra work.
@cbuescher cbuescher force-pushed the improve-pit-relocation branch from 3b233da to 730efd4 Compare September 23, 2025 18:02
@cbuescher
Copy link
Member Author

Hi @dnhatn, can I ask you for a review on this since you probably are still the one most familiar with the PIT id rewrite hook and the general current retry logic for frozen and read-only shards?

@cbuescher cbuescher requested a review from dnhatn September 23, 2025 18:04
@dnhatn
Copy link
Member

dnhatn commented Sep 24, 2025

@cbuescher Sure, I'll take a look today or tomorrow.

@cbuescher cbuescher force-pushed the improve-pit-relocation branch from beb3f48 to fb526d4 Compare September 25, 2025 13:20
@dnhatn
Copy link
Member

dnhatn commented Sep 26, 2025

map ReaderContexts held in SearchService by more than just an auto-incremented Long
id that can clash between different nodes. Instead this PR changes the lookup to use the full
ShardSearchContextId

I think we already handle this case with sessionId, which should be different each time a node is restarted or between different nodes.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @cbuescher. I've left some questions and comments, but I think re-encoding the PIT when the target nodes have changed is a good idea.

.setPointInTime(new PointInTimeBuilder(pitId).setKeepAlive(TimeValue.timeValueMinutes(2))),
resp -> {
assertHitCount(resp, docCount);
updatedPit.set(resp.pointInTimeId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we verify that we have update the PIT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. However, its not impossible the PIT id doesn't change. The test restarts all nodes. If all shards in the original PIT get allocated to their original nodes again we will reopen the context in the same location, leaving everything in that particular SearchContextIdForNode the same. When running tests I definitely saw both scenarios.

}
}

static <Result extends SearchPhaseResult> BytesReference maybeReEncodeNodeIds(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I haven't looked into how we handle partial results with re-encoding. Have you considered this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you are referring to the functionality added with #111516? If I understand that PR correctly we will have SearchContextIdForNode entries in the PIT with a "null" node entry. I think in that case we won't add that shard to the shard iterators of any subsequent search, so we won't get a Result for that shard. That is one reason why I added copying PIT id entries for everything that has no Result from the old to the new encoded ID without change.
Is that what you mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked again at how we handle partial results. The way I see it, when opening the PIT we can tolerate partial results, i.e. failures from certain shards when they are not available. The result will be an entry in the PIT id that has a ShardId but a ShardContextIdForNode with empty node/contextId
The way I see it, we should not change any of these entries, which is what is already happening in this method.
In addition the re-encoding step here doesn't change any entries for shards that have failed in the last search. They shouldn't be included in the "results" list and for that reason not update the related part in the updated PIT id. In cases where these failures are temporary, subsequent searches with the updated ID will try to hit the "old" shard context locations, if any of these can be re-tried we will updated that part of the PIT in a later call.

updatedShardMap.put(shardId, original.shards().get(shardId));
}
}
return SearchContextId.encode(updatedShardMap, original.aliasFilter(), mintransportVersion, failures);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question is how to clean up search contexts from old PITs. For example, when a shard-level request fails, we try to execute it on another copy. In these cases, we re-encode the PIT. If users close the new PIT, the old search context won't be closed. This is not an issue with stateful, but it can be a problem with serverless. Should we close the old search contexts here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean shard-level request that are not related to the search context Missing? I didn't consider this yet, good point I think. I was assuming that if get a result from another node than the original one that would always mean the old context is gone, but you are right we might retry elsewhere for other reasons.
Would you suggest something like the fire-and-forget approach we use e.g. here in TransportSearchAction from this location? I assume a close request is relatively cheap even id the old context doesn't exist any longer and we can treat this as a best-effort attempt. If this fails at some point the context reaper process should clean other stuff up thats over the keepalive limit, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just see TransportSearchAction goes to all shards, it would probably be something more selective to only the shards that had changes in location and use ClearScrollController#closeContexts directly.

@cbuescher
Copy link
Member Author

@dnhatn thanks a lot for the review. I addressed some of your comments and left questions otherwise. I added closing logic in the re-encoding step and adapted the tests accordingly, however I'm not sure that is exactly what you suggested. Let me know what you think.

About your comment above:

map ReaderContexts held in SearchService by more than just an auto-incremented Long
id that can clash between different nodes. Instead this PR changes the lookup to use the full
ShardSearchContextId

I think we already handle this case with sessionId, which should be different each time a node is restarted or between different nodes.

This was partially motivated by other planned changes that will enable re-locating PITs between nodes even without retries triggered by subsequent search requests. I will need to think this over a bit more and might get back to you with more questions to deepen my understanding of how sessionIds are currently used.

@dnhatn dnhatn self-requested a review September 29, 2025 17:56
@dnhatn
Copy link
Member

dnhatn commented Sep 29, 2025

@cbuescher Please let me know when it’s ready for another review. Thank you for working on it.

@cbuescher
Copy link
Member Author

@dnhatn thanks for the feedback, I added some changes around ignoring failures when re-encoding the PIT and left a small comment regarding partial failures in #135231 (comment).
Before I think this is ready for another review I want to gate some of the new behavior behind a feature flag and will ping you once thats done.

elasticsearchmachine and others added 3 commits October 8, 2025 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants